Remove get_start_timestamp_for_gpu_op from trace_linker.py by TaekyungHeo · Pull Request #70 · mlcommons/chakra

TaekyungHeo · 2024-05-23T00:24:06Z

Summary

Remove get_start_timestamp_for_gpu_op from trace_linker.py. In find_parent_cpu_op, the timestamp of a GPU operator must be determined to identify the correct parent CPU operator. The timestamp of a GPU operator is actually determined by the CUDA launcher operator that launched the GPU operator. To determine the timestamp of a GPU operator, we previously used get_start_timestamp_for_gpu_op. However, it turns out that this function is not actually required and buggy. This PR removes get_start_timestamp_for_gpu_op. The bug was due to the limitation of external IDs. Previously, we used external IDs for matching a GPU operator with a CPU operator. However, it is not guaranteed that external IDs always match. Instead, the correlation field appears to be a better way to correlate a GPU operator with a CUDA launcher operator.

Test Plan

$ python3 ci_tools/integration_tests.py --tgz_path tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05.tgz --num_ranks 8 --tolerance 0.05 --expected_times_ms 14597 14597 14968 14638 14649 14
700 14677 14735                                      
Extracting tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05.tgz to tests/data/1.0.2-chakra.0.0.4                                                                                                                    
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_0.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_0.json --output-fi
le tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_0.json                                 
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_1.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_1.json --output-fi
le tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_1.json                                 
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_2.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_2.json --output-fi
le tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_2.json                                 
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_3.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_3.json --output-fi
le tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_3.json                                                                                                                                           
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_4.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_4.json --output-fi
le tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_4.json                                 
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_5.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_5.json --output-fi
le tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_5.json                                 
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_6.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_6.json --output-fi
le tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_6.json                                 
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_7.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_7.json --output-fi
le tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_7.json                                 
Running command: chakra_converter --input_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_0.json --output_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_final_0.chakra -
-input_type PyTorch --log_filename /tmp/rank_0.log                                                                                                                                                                  
Running command: chakra_converter --input_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_1.json --output_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_final_1.chakra -
-input_type PyTorch --log_filename /tmp/rank_1.log                                                        
Running command: chakra_converter --input_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_3.json --output_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_final_3.chakra -
-input_type PyTorch --log_filename /tmp/rank_3.log                                                                                                                                                                  
Running command: chakra_converter --input_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_2.json --output_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_final_2.chakra -
-input_type PyTorch --log_filename /tmp/rank_2.log                                                        
Running command: chakra_converter --input_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_4.json --output_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_final_4.chakra -
-input_type PyTorch --log_filename /tmp/rank_4.log                                                                                                                                                                  
Running command: chakra_converter --input_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_6.json --output_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_final_6.chakra -
-input_type PyTorch --log_filename /tmp/rank_6.log                                                                                                                                                                  
Running command: chakra_converter --input_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_5.json --output_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_final_5.chakra -
-input_type PyTorch --log_filename /tmp/rank_5.log                                                        
Running command: chakra_converter --input_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_7.json --output_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_final_7.chakra -
-input_type PyTorch --log_filename /tmp/rank_7.log                                                        
Validation successful for /tmp/rank_0.log: 14802300us is within the acceptable range.                                                                                                                               
Validation successful for /tmp/rank_1.log: 14785782us is within the acceptable range.                     
Validation successful for /tmp/rank_2.log: 15233261us is within the acceptable range.                     
Validation successful for /tmp/rank_3.log: 14878058us is within the acceptable range.                     
Validation successful for /tmp/rank_4.log: 14892945us is within the acceptable range.                                                                                                                               
Validation successful for /tmp/rank_5.log: 14993779us is within the acceptable range.                     
Validation successful for /tmp/rank_6.log: 14936348us is within the acceptable range.                     
Validation successful for /tmp/rank_7.log: 15031147us is within the acceptable range.

github-actions · 2024-05-23T00:24:18Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

Remove get_start_timestamp_for_gpu_op from trace_linker.py

0f506d4

TaekyungHeo requested a review from a team as a code owner May 23, 2024 00:24

srinivas212 approved these changes May 23, 2024

View reviewed changes

srinivas212 merged commit 2b781a9 into main May 23, 2024

github-actions bot locked and limited conversation to collaborators May 23, 2024

TaekyungHeo deleted the rm-get_start_timestamp_for_gpu_op branch May 23, 2024 13:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove get_start_timestamp_for_gpu_op from trace_linker.py#70

Remove get_start_timestamp_for_gpu_op from trace_linker.py#70
srinivas212 merged 1 commit intomainfrom
rm-get_start_timestamp_for_gpu_op

TaekyungHeo commented May 23, 2024 •

edited

Loading

Uh oh!

github-actions bot commented May 23, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TaekyungHeo commented May 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Uh oh!

github-actions bot commented May 23, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TaekyungHeo commented May 23, 2024 •

edited

Loading